Introduction to GGplot2
The Grammar of Graphics
GGplot2 is one of the core packages under the tidyverse package, which is collection of R packages designed for data science.
The “gg” stands for “Grammar of Graphics”, a book by Leland Wilkinson that offers tools to concicley describe the components of a graphic in statistics and computing.
GGplot2 logic stems from this idea, that you can build every graph from the same few components: a data frame, visual marks (geoms) representing the data, and a coordinate system.
It is more flexible and versatile than the graphs produced by R’s base package, and once you get a grip of the syntax and function arguments, it becomes easy to create beautiful and elaborate visualizations.
As Hadley Wickham explained: “You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.”
Grammatical elements of ggplot2
A key feature of ggplot2 is that it allows to layer graphical elements on top of each other. You can imagine a stack of layers, each adding onto the layers before it.
- Data - the data frame we want to use for our plot
- Aesthetics (aes) - the scales we want to map our data onto
- Geometrics (geom) - the geometrical shapes representing our data
- Themes - the appearance of the non-data aspects of the plot
- Statistics - data representations to aid understanding
- Coordinates/Scales - the range and limits of our plot
- Facets - the layout of multiple plots and subplots
The first three elements: data, aesthetics (aes), and geometrics (geom), are the basic elements. We must define them in the ggplot function in order to produce a meaningful plot.
The remaining elements are “optional”, that is, they are set to a default. This means we are not required to define them when we plot, though typically we would want to adjust them to make sure our graphs better fit our needs.
In this presentation I will focus mainly on the first three elements, and specifically on the most commonly used geoms.
Lets get to work!
Installing packages
Begin by installing and loading the tidyverse package, which includes ggplot2, among other usefull packages such as dplyr and tidyr which are used for manipulating data prior to plotting.
You only need to install a package once, but you will need to “load” it every time you restart a session
If you solely want to install the ggplot2 package you can use a similar line of code, but you will most likely use dplyr, so you may as well install tideyverse which includes both (and more)
Our Data
For this exercise we will use diamonds from the dataset package, and the gapminder dataset from the gapminder package. Both are available on r.
## Warning: package 'gapminder' was built under R version 3.6.3
We will start working with the diamonds data.
The first step should always be to examine the dataset. What variable we have? What datatype is each variable? How many observations are included?
You can use the structure function str(), or the summary function summary() if you want more details on each variable.
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
If only want the variable names for easier access, you can simply list the column of the dataset using colnames().
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
Now let’s continue exploring the data by plotting it with ggplot2
The ggplot2 syntax
The first line of code in ggplot2 requires us to input the data frame we intend to use, and the aesthetics we want to map our data on. This line typically includes all the data needed for creating the plot. The function synatx is writtern as: ggplot(data, aes())
For instance, to plot the price of diamonds based on their carat we need to set “diamonds” as the data, and map “carat” and “price” onto the x and y aesthetics.
The function can be written either as: ggplot(data = diamonds, aes(x = carat, y = price)) or simply as: ggplot(diamonds, aes(carat, price))
This creates the base layer of our plot, which includes the dimensions we defined for the aesthetics. In order to present the observations, we need to add geometric layers. For every layer we add, we need to place a “+” sign.
For instance, to present a trend line of the average price by carat, we can add a geom_smooth() layer. This geom creates a regression line with a confidence intervals.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
However, a regression line is not very telling about the observations. In this instance, it would make more sense to create a scatterplot in order to see the spread of the observations. We can do this by adding a geom_point() layer.
Scatterplots
Many of the observations are overlapping, making it difficult to see the actual distribution. To help remedy overplotting, we can adjust the transparency of the points by reducing the alpha and also increase the size of the points inside the geom_point layer.
This looks better, but it is still difficult to make insights from this plot. We can add another aesthetic mapping to deferantiate between diamonds with different cuts. In this example we will map “cut” onto the color aesthetic in the ggplot line,
We could change the aestehtic maping inside the geom layer, rather then the ggplot line.We might choose to do so if we want to assign different aesthetic mappings to different geom layers, or if we are plotting values from different data frames.
In the example above, if we move the aes(color = cut) into the geom_point() layer, we will produce the same graph.
Finaly, notice the difference between aesthetic mappings, which represent scales, and atributes which represent fixed values.
Instead of assigning a fixed size, we could map size onto a variable
Stacking Layers As mentioned previously, ggplot2 enables us to add multiple element layers on top of each other.
We need to add another “+” sign at the end of each row to indicate there is another line.
When adding geom layers, each new layer will appear on top of the previous layers. This means the order of the geom layers matters.
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(se = FALSE)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Notice that the aesthetic mappings defined in the first line are automatically adopted by all the geom layers. Aesthetics and attributes added to an individual geom layer affect only that layer, and they can override aesthetic mappings from the main ggplot() line.
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(color = "deeppink3")## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(aes(linetype = cut), size = 2, se = FALSE)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We can add multiple geom layers of the same type
# additng two geom_smooth() layers
ggplot(diamonds, aes(carat, price, color = cut)) +
geom_point(alpha = 0.4, size = 2) +
geom_smooth(color = "deeppink3", se = FALSE) +
geom_smooth(color = "blue", method = lm) +
ylim(0,20000)## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 38 rows containing missing values (geom_smooth).
Each geom type has multiple arguments which are set to to default values, which we can easily change based on our needs. For instance, geom_point can take arguments relating to x, y, alpha, color, fill, shape, and weight .
In the previous examples we changed the alpha and size attributes of the points.
For a cheat sheet with ggplot2 geom argumentsby rStudio visit this link.
Improving the plot
Before we continue, let’s make make our lives a bit easier. Instead of typing the function over and over again, we can assign the function to an object and simply add layers to that object.
## assigning a ggplot2 function to an object
dd <- ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) + geom_point(alpha = 0.4, size = 2)
# same as: dd <- ggplot(diamonds, aes(carat,price, color = cut)) + geom_point(alpha = 0.4, size = 2)Now we can add layers and adjustments to “dd” which already containts our predefined ggplot() + geom_point() .
Vertical & Horisontal Lines
We can add lines to indicate the median and mean of carat. To add vertical and horizontal lines we use geom_vline() and geom_hline() correspondingly.
dd +
geom_vline (xintercept = 0.7, linetype = "dashed",color = "#b22222") +
geom_vline (xintercept = 0.7979, linetype = "dashed",color = "turquoise4")We can also add tags to the lines with geom_tex() to indicate what they represent.
dd +
geom_vline (xintercept = 0.7, linetype = "dashed",color = "#b22222") +
geom_vline (xintercept = 0.7979, linetype = "dashed",color = "turquoise4") +
geom_text(aes(x=0.7, label="carat median", y=14000), vjust = -1, color= "#b22222", angle=90, size=3) +
geom_text(aes(x=0.7979, label="carat mean", y=14000), vjust = 1, color= "turquoise4", angle=90, size=3)We can see that much of the data is condenced on the left side of the plot. We can handle this by adjusting the data or, better yet, adjusting the scale.
Adjusting the data
Using dplyr functions, we can filter out observations greater than 3 carats. We’ll create a new dataset by saving the filtered data into an object called “smallD”
We then plot the same aesthetics using the new data frame “smallD”
Adusting the scales
Instead of filtering out extreme observations, we can adjust the x axis, either by changing its limits with xlim(), or by LOGing the values of the x scale with scale_x_log10()
Limiting the scale deletes the points outside the limit range
## Warning: Removed 32 rows containing missing values (geom_point).
Limiting the x scale for the
diamond dataset created a graph that is identical to the one we created with the smallD dataframe.
Loging the scale keeps all the data points, but stretches the axis exponentially
LOGing is useful when the data is very skewed, as in the case of the
gapminder data which I will go back to at the end of the presentation. For the diamond dataset, I would probably choose to limit the axis scale rather than LOG the scale.
Facets & Themes
We can further exmaine diferences by arranging the data into subplots. with facet_grid() and facet_wrap()
facet_grid() creates
One of the benifit of creating sub plots with facets is that the scales are paralel across plots.
Now lets see what happens when we use geom_point() with a categotical X.
There is over plotting bevause all the observations are alligned on the same x value. GGplot2 enables us to “jitter” the points in order to overcome overplotting. We do this either by adding a
geom_jitter() layer instead of geom_point() layer, or alternativley, we can add a jitter argument into the geom_point() line as follows:
geom_point(position = “jitter”)
ggplot(diamonds, aes(cut, price, color = cut)) +
geom_point(alpha = 0.4, size = 2, position = "jitter")We focused on many examples with scatterplots (geom_point()), but the logic of the function arguemnts and layers is aplicable to the other geom types as well.
Bar Charts
The height of bars geom_bar() represents the number of cases in each group. Thus it only takes an “x” aesthetic.
The height of bars geom_col() represents other other values in the data, which is why it also requires a “y” aesthetic.
Alternativley, you could change the stat argument setting inside geom bar to identity in the following manner geom_bar(stat = “identity”), which will enable it to take on a “y” aesthetic as well. _
Asignng the color aesthetic would change the color of the outlines rather then the fill of the bars. to change the color of the bars we use the fill aesthetic.
ggplot(diamonds, aes(x = cut, fill = cut)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
ylim(0, 25000)By assigning the fill to another variable, we split each bar into subgroups.
The default position is set to “stack”, which is why the cut levels are stacked upon each other. The other options are position = “fill” which fills each bar to represent 100%. The third option is position = “dodge” which places the groups next to eachother
ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "dodge") +
facet_grid(. ~ clarity) +
theme(axis.text.x = element_text(angle = 90)) Finaly, you can also change the direction of the bar by fliping it 90 degrees with coord_flip(), or create a circular center with coord_polar()
ggplot(diamonds, aes(x = color, fill = color)) +
geom_bar() +
coord_flip() +
theme(legend.position = "none")ggplot(diamonds, aes(x = color, fill = color)) +
geom_bar() +
coord_polar() +
theme(legend.position = "none")Histograms
Histograms are used for contiuus x variables, as opposed to bar charts which are used for catagrocial variables.
Boxplots
ggplot(diamonds, aes(color, price, fill = color)) +
geom_boxplot() +
labs(title = "My amazing diamnd boxplot chart", x = "Diamond Color Grade", y = "Price")ggplot(diamonds, aes(color, price, fill = cut)) +
geom_boxplot() +
labs(title = "Diamond price by color grade and cut", x = "Diamond Color Grade", y = "Price")Line graphs
Line graphs produced by geom_line are suitable for longitudinal data in which we desire to show variance over time, or between different treatments. For the diamond data a line graph will look like a hot mess.
To better demostrate the line geom, We will move to the gapminder data which contains information on life expectancy of countries over the past seven decades.
The Gapminder data
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
Lets see what happens when we plot life expectancy by year using geom_line().
This graph is not very telling because it is basically going through all the data points on each year. We can add a color aesthetic to create subgroups for each continent. Let’s check if that helps.
It still looks like a mess because the lines are still passing through all the data points of each year. What we want is lines that represent the average of each group, similar to the trend line produced by
geom_smoot().
# adding a geom_smooth() layer
ggplot(gapminder, aes(year, lifeExp, color = continent)) +
geom_line() +
geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
So basically, we need to create another variable that represents the average for each group by year. This is where dplyr becomes important and useful. We can easily create new dataframe from the gapminder data via dplyr functions to add the desired values. We first group_by continenet and year, and then we add summarize variables which calculate the total population and the average life expectency.
I saved this dataframe in an object called “yearContinent”
yearContinent <- gapminder %>%
group_by(year, continent) %>%
summarize(totalPop = sum(as.numeric(pop)),AverageLifeExp = mean(lifeExp))
# display first 10 rows of the new dataframe we created
yearContinent## # A tibble: 60 x 4
## # Groups: year [12]
## year continent totalPop AverageLifeExp
## <int> <fct> <dbl> <dbl>
## 1 1952 Africa 237640501 39.1
## 2 1952 Americas 345152446 53.3
## 3 1952 Asia 1395357351 46.3
## 4 1952 Europe 418120846 64.4
## 5 1952 Oceania 10686006 69.3
## 6 1957 Africa 264837738 41.3
## 7 1957 Americas 386953916 56.0
## 8 1957 Asia 1562780599 49.3
## 9 1957 Europe 437890351 66.7
## 10 1957 Oceania 11941976 70.3
## # … with 50 more rows
Now when we add a geom_line(), the lines represent the average for each continent
We can also add a geom_point() on top of the lines to so that average value for each year is visually clearer.
ggplot(yearContinent, aes(year, AverageLifeExp, color = continent)) +
geom_line(aes(linetype = continent)) +
geom_point(aes(size = totalPop)) +
geom_hline (linetype = "dashed",color = "#000055", yintercept = 59.47) +
geom_text(aes(x=2000, label="global average life Exp", y=61), color= "#000055", size=3)Final notes
Now lets go back to and see what the data looks like when geom_point() to create a scatther plot by year
This happened because year is a categorical variable. Remember the jitter option for geom_point() which “jitters” the points to avoid overplotting?
We can also facet the data by year to see variations by year.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
facet_wrap(~ year, ncol = 3)remeber that LOGing scales helps when the data is very skewed.This data is much more skewed than the diamond data.
# loging the x scale
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp,color = continent)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ year, ncol = 3)Now lets put everything we learned together.
First we’ll assign the main function and scale log to an object named “gm” (short for gapminder)
gm <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp,color = continent)) +
geom_point() +
scale_x_log10()Now we will plot the object and add faceting, themes, and labels.
gm +
facet_grid(continent ~ year) +
theme(axis.text.x = element_text(angle = 90)) +
theme(legend.position = "none") +
labs(title = "Life Expectancy by GDP, Continent and Year", x = "GDP", y = "Life Expectancy")This is all for now
Tutorials
Continue learning and practicing ggplot2 on your own:
- Data Visualization - in “R for Data” Science - Hadley Wickham’s e-book guide to R
- The Complete ggplot2 Tutorial - - Tutorial by Selva Prabhakaran
- Stack Overflow - Great forum for asking questions from the community
- Data Camp course - First lesson of each course is free
- Interactive charts - Convert your ggplot2 figures into interactive ones powered by plotly.js